Skip to content

fix: cache only completed container profiles to ensure data integrity#813

Merged
matthyx merged 8 commits into
mainfrom
fix/cp-cache-completed
May 18, 2026
Merged

fix: cache only completed container profiles to ensure data integrity#813
matthyx merged 8 commits into
mainfrom
fix/cp-cache-completed

Conversation

@matthyx
Copy link
Copy Markdown
Contributor

@matthyx matthyx commented May 18, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Prevents partial or non-terminal profiles from replacing cached entries; entries remain pending until a terminal status (Completed/TooLarge) and are retried safely.
    • Refresh now preserves existing cache when a fetched profile is not terminal.
  • New Features

    • Cache now accepts and promotes entries immediately after a profile reaches Completed, reducing delay.
  • Tests

    • Updated tests and fixtures to reflect the refined completion-status behavior.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Warning

Rate limit exceeded

@matthyx has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 6 minutes and 22 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3307895b-36ef-4243-b6f4-620fc5dbcfc2

📥 Commits

Reviewing files that changed from the base of the PR and between d5dd011 and 9767097.

📒 Files selected for processing (3)
  • pkg/containerprofilemanager/v1/monitoring.go
  • pkg/objectcache/containerprofilecache/containerprofilecache.go
  • pkg/objectcache/containerprofilecache/reconciler_test.go
📝 Walkthrough

Walkthrough

Container profile cache now requires the consolidated ContainerProfile to have a terminal status annotation (Completed or TooLarge) before populating or rebuilding projections; a CompletionNotifier interface and wiring let the profile manager notify the cache, which retries promoting pending entries when a CP becomes completed. Tests and fixtures are updated to reflect completed-status eligibility.

Changes

Container Profile Cache Completion-Only Gating

Layer / File(s) Summary
Notifier wiring and interface
pkg/objectcache/completion_notifier.go, pkg/containerprofilemanager/v1/containerprofile_manager.go, pkg/containerprofilemanager/v1/monitoring.go, cmd/main.go, go.mod
Add CompletionNotifier interface; add completionNotifier field, SetCompletionNotifier setter and notifyCompleted helper to ContainerProfileManager; call notify on monitored completion paths; register the cache as notifier at startup; bump k8s-interface module version.
Completed-Only Gate Implementation
pkg/objectcache/containerprofilecache/containerprofilecache.go, pkg/objectcache/containerprofilecache/reconciler.go
tryPopulateEntry and refreshOneEntry now gate on consolidated CP StatusMetadataKey being terminal (Completed/TooLarge) and keep entries pending otherwise; reconciler comments clarified and minor formatting/alignment edits applied.
NotifyContainerCompleted retry implementation
pkg/objectcache/containerprofilecache/containerprofilecache.go
Add exported NotifyContainerCompleted(containerID string) which retries promoting a pending container entry by calling tryPopulateEntry with bounded attempts and per-attempt timeouts; add helper isTerminalCPStatus.
Tests and fixtures updated for Completed status
pkg/objectcache/containerprofilecache/containerprofilecache_test.go, pkg/objectcache/containerprofilecache/reconciler_test.go, pkg/objectcache/containerprofilecache/t8_overlay_refresh_test.go
Update many ContainerProfile fixtures to include StatusMetadataKey: Completed; replace/adjust tests to assert pending behavior for non-Completed CPs and promotion after transition to Completed; add required imports for helpers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • kubescape/node-agent#788: Earlier refactor that established the container profile cache and reconciler/pending retry baseline; relevant to this completed-only gating and promotion flow.

Poem

🐰 I nudged the cache to wait till done,
Not partial shadows, not half-run fun,
When profiles finish, I softly tap—
Retry, promote, close the gap. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: cache only completed container profiles to ensure data integrity' clearly and specifically summarizes the main change: implementing logic to cache only completed container profiles rather than partial ones.
Docstring Coverage ✅ Passed Docstring coverage is 89.47% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cp-cache-completed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

slashben
slashben previously approved these changes May 18, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/objectcache/containerprofilecache/containerprofilecache.go`:
- Around line 365-373: The code currently sets cp = nil when the retrieved
container profile's status is not Completed, but later the synthetic-CP fallback
(the overlay-based construction logic) can still populate the cache; change the
flow so a non-Completed real CP prevents any caching/overlay fallback. In
tryPopulateEntry, detect when cp exists but its
annotations[helpersv1.StatusMetadataKey] != helpersv1.Completed and either
return early or set a guard (e.g., skipCaching or allowFallback=false) so the
synthetic-CP creation path does not run; ensure the synthetic-CP/overlay logic
(the fallback starting where overlays are merged) checks that guard or that an
original cp was Completed before creating/adding an entry. Use the cp variable
and the synthetic-CP fallback code paths as the points to modify.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f2a84457-8e70-4481-9d49-6f655e7308c6

📥 Commits

Reviewing files that changed from the base of the PR and between cc59fa0 and 22f2e68.

📒 Files selected for processing (5)
  • pkg/objectcache/containerprofilecache/containerprofilecache.go
  • pkg/objectcache/containerprofilecache/containerprofilecache_test.go
  • pkg/objectcache/containerprofilecache/reconciler.go
  • pkg/objectcache/containerprofilecache/reconciler_test.go
  • pkg/objectcache/containerprofilecache/t8_overlay_refresh_test.go

Comment thread pkg/objectcache/containerprofilecache/containerprofilecache.go Outdated
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.213 0.204 -4.2%
Peak CPU (cores) 0.222 0.223 +0.4%
Avg Memory (MiB) 361.802 267.182 -26.2%
Peak Memory (MiB) 364.680 274.684 -24.7%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 1 0 0.0%
hardlink 6000 0 0.0%
http 1708 119455 98.6%
network 901 78002 98.9%
open 35017 621316 94.7%
symlink 6000 0 0.0%
syscall 984 1885 65.7%
Event Counters
Metric BEFORE AFTER
capability_counter 9 10
dns_counter 1461 1444
exec_counter 7337 7264
network_counter 96465 95461
open_counter 803468 795369
syscall_counter 3596 3610

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.204 0.199 -2.5%
Peak CPU (cores) 0.215 0.209 -2.9%
Avg Memory (MiB) 321.011 268.534 -16.3%
Peak Memory (MiB) 322.906 276.281 -14.4%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 0 0 N/A
hardlink 6000 0 0.0%
http 1763 119395 98.5%
network 905 77920 98.9%
open 36293 619954 94.5%
symlink 6000 0 0.0%
syscall 991 1884 65.5%
Event Counters
Metric BEFORE AFTER
capability_counter 11 8
dns_counter 1442 1413
exec_counter 7212 7103
network_counter 94837 93362
open_counter 789854 779233
syscall_counter 3630 3470

Instead of polling pending containers every 5s (6× storage pressure),
wire a CompletionNotifier from containerprofilemanager to the CP cache.
NotifyContainerCompleted is called on ContainerReachedMaxTime and
ObjectCompletedError, launching a bounded retry goroutine (5 × 3s) that
promotes the pending entry as soon as storage consolidation completes.
The 30s reconciler tick retains retryPendingEntries as a safety net.

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@matthyx matthyx force-pushed the fix/cp-cache-completed branch from 2814e43 to 9b767e3 Compare May 18, 2026 12:06
…tion window

Storage consolidation runs every 30s; the original 5×3s=15s retry window
expired before consolidation completed, leaving the container in pending
until the next 30s reconciler tick. Increasing to 20×3s=60s ensures at
least one retry fires after consolidation, closing the race with
Test_12_MergingProfilesTest's 10s post-completion sleep.

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.199 0.000 -100.0%
Peak CPU (cores) 0.235 0.000 -100.0%
Avg Memory (MiB) 320.104 0.000 -100.0%
Peak Memory (MiB) 323.027 0.000 -100.0%
Dedup Effectiveness

No data available.

Event Counters
Metric BEFORE AFTER
capability_counter 10 0
dns_counter 56 0
exec_counter 7230 0
network_counter 81588 0
open_counter 791049 0
syscall_counter 3533 0

@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.210 0.213 +1.5%
Peak CPU (cores) 0.221 0.227 +2.9%
Avg Memory (MiB) 346.896 266.175 -23.3%
Peak Memory (MiB) 352.250 278.340 -21.0%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 1 0 0.0%
hardlink 6000 0 0.0%
http 1761 119398 98.5%
network 902 77840 98.9%
open 36139 620116 94.5%
symlink 6000 0 0.0%
syscall 975 1879 65.8%
Event Counters
Metric BEFORE AFTER
capability_counter 9 8
dns_counter 358 1384
exec_counter 7237 7105
network_counter 84569 93062
open_counter 791834 779429
syscall_counter 3534 3481

matthyx and others added 2 commits May 18, 2026 15:17
TooLarge is a terminal state (manager stopped collecting; data is
truncated but valid for detection). The Completed-only gate in
tryPopulateEntry and refreshOneEntry incorrectly kept TooLarge
containers pending indefinitely since they never transition to
Completed. Introduce isTerminalCPStatus(Completed|TooLarge) and
apply it in both code paths. Add TestTooLargeCP_Accepted covering
the addContainer and refresh paths.

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire notifyCompleted into the ObjectTooLargeError path in
handleSaveProfileError so pending containers promoted to TooLarge
get the same immediate cache-promotion goroutine as Completed ones.
Without this, a TooLarge container stayed in the pending map for up
to 30s (next reconciler tick) even though a valid truncated CP was
already in storage. Add TestNotifyContainerTerminal_TooLarge to
assert the fast-path promotion fires without waiting for the tick.

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.194 0.000 -100.0%
Peak CPU (cores) 0.201 0.000 -100.0%
Avg Memory (MiB) 328.767 0.000 -100.0%
Peak Memory (MiB) 331.152 0.000 -100.0%
Dedup Effectiveness

No data available.

Event Counters
Metric BEFORE AFTER
capability_counter 10 0
dns_counter 1450 0
exec_counter 7253 0
network_counter 95430 0
open_counter 796095 0
syscall_counter 3514 0

When a container exits normally the monitorContainer loop handles
ContainerHasTerminatedError: it saves the final CP (Status=Completed
set by lifecycle.go) but previously never called notifyCompleted, so
pending cache entries were only promoted on the next 30s reconciler
tick. Fire notifyCompleted when the terminal status is Completed so
the fast-path promotion goroutine runs immediately.

Add TestNotifyContainerTerminal_Completed: pending → Completed via
notification, verified without waiting for the periodic tick.

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.000 0.000 N/A
Peak CPU (cores) 0.000 0.000 N/A
Avg Memory (MiB) 0.000 0.000 N/A
Peak Memory (MiB) 0.000 0.000 N/A
Dedup Effectiveness

No data available.

goradd/maps.SafeMap.Load() reads m.items == nil without a lock while
Set() initializes m.items under a write lock, causing a data race on
the first concurrent access to a zero-value SafeMap. Pre-initialize
entries and pending in NewContainerProfileCache so m.items is non-nil
before any goroutine can call Load concurrently.

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.199 0.000 -100.0%
Peak CPU (cores) 0.207 0.000 -100.0%
Avg Memory (MiB) 319.202 0.000 -100.0%
Peak Memory (MiB) 325.180 0.000 -100.0%
Dedup Effectiveness

No data available.

Event Counters
Metric BEFORE AFTER
capability_counter 12 0
dns_counter 1433 0
exec_counter 7240 0
network_counter 95103 0
open_counter 791652 0
syscall_counter 3528 0

@matthyx matthyx added the release Create release label May 18, 2026
@matthyx matthyx merged commit e4f6ec9 into main May 18, 2026
29 of 30 checks passed
@matthyx matthyx deleted the fix/cp-cache-completed branch May 18, 2026 14:44
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.210 0.200 -4.6%
Peak CPU (cores) 0.221 0.215 -2.8%
Avg Memory (MiB) 342.993 269.575 -21.4%
Peak Memory (MiB) 349.398 276.855 -20.8%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 2 0 0.0%
hardlink 6000 0 0.0%
http 1764 119396 98.5%
network 900 78000 98.9%
open 34933 621315 94.7%
symlink 6000 0 0.0%
syscall 986 1892 65.7%
Event Counters
Metric BEFORE AFTER
capability_counter 8 10
dns_counter 1442 1421
exec_counter 7265 7111
network_counter 95432 93512
open_counter 795347 779408
syscall_counter 3627 3467

@matthyx matthyx moved this to To Archive in KS PRs tracking May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release Create release

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants